Michael Clark
Seek to find orthogonal factors to minimize reconstruction error.
\[\mathscr{Loss} = X - ZW^T\]
X is the \(NxD\) matrix of observed scores, \(Z\) are the \(N x L\) component scores where L < D, while \(W\) are the \(DxL\) weights, typically referred to as the factor loading matrix.
Each component, or factor, created accounts for less of the variance in the original data. With all components,
\[X = ZW^T\]
The MNIST data contains 28 by 28 pixel images of digits 0-9. If we unroll the data to 784 columns, each row represents a single digit. We can see in the following how well we can reconstruct a digit via PCA. Even with only two components we can get a sense of what we’re looking at. With all components the reconstruction is perfect.
Let’s see an example with more digestible results. The following data are 16 multiple choice ability items taken from the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project. There are 1525 total subjects, and four categories of questions.
library(psych)
cog_ability = psych::iqitemsLet’s do a PCA. We’ll look at four components, though remember that PCA technically returns as many components as there are variables. You’re just seeing part of the results. Note that PCA almost universally requires you to standardize the data, so that each variable has the same variance. Otherwise you’ll just get components reflecting the relative variances. In what follows the PCA is done on the correlation matrix, which amounts to the same thing.
pc = principal(cog_ability, 4)
pcPrincipal Components Analysis
Call: principal(r = cog_ability, nfactors = 4)
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 RC4 h2 u2 com
reason.4 0.60 0.01 0.14 0.09 0.38 0.62 1.2
reason.16 0.67 0.07 0.11 0.04 0.47 0.53 1.1
reason.17 0.50 0.26 -0.08 0.11 0.33 0.67 1.7
reason.19 0.64 0.02 0.14 0.04 0.43 0.57 1.1
letter.7 0.71 0.05 0.02 -0.01 0.51 0.49 1.0
letter.33 0.35 -0.06 0.06 0.58 0.46 0.54 1.7
letter.34 0.64 0.04 0.06 -0.03 0.42 0.58 1.0
letter.58 0.43 -0.17 0.07 0.42 0.40 0.60 2.4
matrix.45 0.64 0.11 0.08 0.10 0.43 0.57 1.1
matrix.46 -0.11 0.26 0.08 0.61 0.46 0.54 1.5
matrix.47 -0.02 0.17 0.06 0.66 0.46 0.54 1.2
matrix.55 0.32 0.32 0.11 0.11 0.23 0.77 2.5
rotate.3 0.09 0.72 0.27 0.12 0.61 0.39 1.4
rotate.4 0.09 0.81 -0.12 0.10 0.70 0.30 1.1
rotate.6 0.17 -0.06 0.84 0.06 0.74 0.26 1.1
rotate.8 0.14 0.22 0.77 0.14 0.68 0.32 1.3
RC1 RC2 RC3 RC4
SS loadings 3.28 1.55 1.49 1.40
Proportion Var 0.20 0.10 0.09 0.09
Cumulative Var 0.20 0.30 0.39 0.48
Proportion Explained 0.42 0.20 0.19 0.18
Cumulative Proportion 0.42 0.63 0.82 1.00
Mean item complexity = 1.4
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.07
with the empirical chi square 2033.18 with prob < 0
Fit based upon off diagonal values = 0.87
First focus on the portion of the output where it says SS loadings . The first line is the sum of the squared loadings1 for each component (in this case where we are using a correlation matrix as the basis, summing across all 16 possible components would equal the value of 16). The Proportion Var tells us how much of the overall variance the component accounts for out of all the variables (e.g. 3.28 / 16 = 0.2). The Cumulative Var tells us that 4 components make up over 48% the variance. The others are the same thing just based on the 4 retained components rather than all 16 variables (i.e. the cumulative explained variance would = 1). We can see that the second component accounts for less variance, and this would continue with additional components, where each accounts for a decreasing amount of variance.
Some explanation of the other parts of the output:
h2: the amount of variance in the item/variable explained by the (retained) components. It is the sum of the squared loadings, a.k.a. communality. For example, the first reasoning item shares only 38% of its variance with these components.u2: 1 - h2com: A measure of complexity. A value of 1 might be seen for something that loaded on only one component, and zero otherwise (a.k.a. perfect simple structure).Loadings, also referred to as the pattern matrix, in this scenario represent the estimated correlation of an item with its component, and provide the key way in which we interpret the factors. As an example, we can reproduce the loadings by correlating the observed variables with the estimated component scores2.
cor(cog_ability, pc$scores, use = 'pair') %>% round(2) RC1 RC2 RC3 RC4
reason.4 0.60 0.01 0.14 0.09
reason.16 0.67 0.07 0.11 0.04
reason.17 0.50 0.26 -0.08 0.11
reason.19 0.64 0.02 0.14 0.04
letter.7 0.71 0.05 0.02 -0.01
letter.33 0.35 -0.06 0.06 0.58
letter.34 0.64 0.04 0.06 -0.03
letter.58 0.43 -0.17 0.07 0.42
matrix.45 0.64 0.11 0.08 0.10
matrix.46 -0.11 0.26 0.08 0.61
matrix.47 -0.02 0.17 0.06 0.66
matrix.55 0.32 0.32 0.11 0.11
rotate.3 0.09 0.72 0.27 0.12
rotate.4 0.09 0.81 -0.12 0.10
rotate.6 0.17 -0.06 0.84 0.06
rotate.8 0.14 0.21 0.77 0.14
It can be difficult to sort it out just by looking at the values, so we’ll look at it visually. In the following plot, stronger loadings are indicated by blue, and we can see the different variables associated with different components.
Interpretation is the fun, but commonly difficult, part. In this case, the variables don’t appear to be grouping like we’d expect, except for some of the reasoning scores3. It’s worth mentioning the naming fallacy at this point. Just because we associate a factor with some concept, doesn’t make it so. In addition, for PCA the goal is not interpretation, but to reduce the data while retaining as much of the variance as possible.
Why do it?
Issues
Other stuff
\[X \approx ZW^T\]
fa_model = fa(cog_ability, 4, rotate='oblimin')
fa_modelFactor Analysis using method = minres
Call: fa(r = cog_ability, nfactors = 4, rotate = "oblimin")
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR3 MR2 MR4 h2 u2 com
reason.4 0.53 0.04 -0.04 0.04 0.30 0.70 1.0
reason.16 0.61 0.03 0.03 -0.04 0.40 0.60 1.0
reason.17 0.43 -0.05 0.11 0.08 0.22 0.78 1.2
reason.19 0.57 0.04 -0.03 0.00 0.34 0.66 1.0
letter.7 0.67 -0.03 0.03 -0.09 0.43 0.57 1.0
letter.33 0.33 0.02 -0.04 0.27 0.21 0.79 2.0
letter.34 0.57 0.00 0.00 -0.05 0.32 0.68 1.0
letter.58 0.38 0.03 -0.07 0.14 0.19 0.81 1.4
matrix.45 0.58 0.00 0.00 0.07 0.36 0.64 1.0
matrix.46 -0.05 0.01 0.04 0.49 0.25 0.75 1.0
matrix.47 0.03 0.03 0.05 0.37 0.17 0.83 1.1
matrix.55 0.27 0.04 0.09 0.15 0.15 0.85 1.8
rotate.3 0.00 0.25 0.46 0.08 0.36 0.64 1.6
rotate.4 0.02 -0.04 0.80 0.00 0.64 0.36 1.0
rotate.6 0.05 0.65 -0.11 -0.03 0.43 0.57 1.1
rotate.8 -0.01 0.68 0.08 0.04 0.49 0.51 1.0
MR1 MR3 MR2 MR4
SS loadings 2.68 1.01 0.95 0.60
Proportion Var 0.17 0.06 0.06 0.04
Cumulative Var 0.17 0.23 0.29 0.33
Proportion Explained 0.51 0.19 0.18 0.11
Cumulative Proportion 0.51 0.70 0.89 1.00
With factor correlations of
MR1 MR3 MR2 MR4
MR1 1.00 0.40 0.19 0.16
MR3 0.40 1.00 0.12 0.37
MR2 0.19 0.12 1.00 0.38
MR4 0.16 0.37 0.38 1.00
Mean item complexity = 1.2
Test of the hypothesis that 4 factors are sufficient.
The degrees of freedom for the null model are 120 and the objective function was 2.6 with Chi Square of 3944.59
The degrees of freedom for the model are 62 and the objective function was 0.06
The root mean square of the residuals (RMSR) is 0.02
The df corrected root mean square of the residuals is 0.02
The harmonic number of observations is 1523 with the empirical chi square 100.71 with prob < 0.0014
The total number of observations was 1525 with Likelihood Chi Square = 91.43 with prob < 0.0089
Tucker Lewis Index of factoring reliability = 0.985
RMSEA index = 0.018 and the 90 % confidence intervals are 0.009 0.025
BIC = -363.01
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy
MR1 MR3 MR2 MR4
Correlation of (regression) scores with factors 0.90 0.82 0.84 0.69
Multiple R square of scores with factors 0.80 0.67 0.70 0.48
Minimum correlation of possible factor scores 0.61 0.33 0.40 -0.04
modelCode = "
verbal =~ reason.4 + reason.16 + reason.17 + reason.19
spatial =~ rotate.3 + rotate.4 + rotate.6 + rotate.8
# not necessary
# verbal ~~ spatial
# if you don't want them correlated
# verbal ~~ 0*spatial
"
famod = cfa(modelCode, data=cog_ability)
summary(famod, standardized=T, rsq=T, nd=2)
pars = parameterEstimates(famod, standardized=T)\[ X \sim \mathcal{N}(ZW^T + \mu, \Psi) \] \(\mu\) are the intercepts, \(\Psi\) is a \(DxD\) covariance matrix.
For probabilistic PCA \(\Psi\) is \(\sigma^2I\).
For standard PCA, \(\sigma \rightarrow 0\),
Commonly our data regards counts or otherwise compositional data. In this case we can use techniques that are better suited to such data.
Non-negative matrix factorization is similar to PCA, just with the constraint that scores and weights be positive. It is (in the usual implementation) identical to probabilistic latent semantic analysis.
Latent Dirichlet Allocation takes a different approach to deal with count data. The classic example regards text analysis, where LDA is applied to word frequencies across topics. Consider a document-term matrix where each row regards a document, and each column a word or term. LDA looks for latent topics that are probability distributions over the terms. There is a probability distribution for the topics as well as the terms, and given both, we will see some occurrence of the term in each document. Each document can be seen a mix of topics.
In the factor analysis sense, each variable (the term) will have some probability of occurring for a given topic, i.e. factor. One can think of the term/variable probabilities for a specific topic similar to how we did loadings in the factor analysis sense. Every variable will have some non-zero probability of occurrence for a given factor, but often they are essentially zero. Factor scores are the document topic probabilities.
Given the probabilities of topics and terms within topics, there is a multinomial draw of counts for an observation. With this probabilistic result, there isn’t a ‘reconstruction’ connotation with LDA. However we can simulate a mulinomial draw based on the total counts we see. For example, let’s assume 5 terms seen a total of 20 times given some probability.
t(rmultinom(1, size = 20, prob=5:1)) # first term is most likely, 5th term is least likely [,1] [,2] [,3] [,4] [,5]
[1,] 4 7 5 2 2
Those are the counts we see for a given observation.
LDA wouldn’t be the best tool if your goal is prediction, and you don’t care much about interpretation.
LDA is basically NMF with Dirichlet priors on the topics.
They are the eigenvalues of the correlation matrix. In addition, they are the diagonal of the crossproduct of the loading matrix.↩
These results have been rotated, something practically no one using PCA actually does.↩
Results like this are more the rule than the exception in my experience. Not enough initial development is typically done with scales, and when other people use them in other settings, it often ends with disappointment. Just think, many psychological scales are developed on freshman psychology students. How well do you think that will generalize?↩